SWSP Project - WESAD Dataset EDA

Introduction

In this notebook we will explore the extracted features from the WESAD dataset.

%reload_ext pretty_jupyter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pretty_jupyter.helpers import matplotlib_fig_to_markdown

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

Now we will import the data:

data = pd.read_csv('data/m14_merged.csv')
data.head()
Unnamed: 0 net_acc_mean net_acc_std net_acc_min net_acc_max EDA_phasic_mean EDA_phasic_std EDA_phasic_min EDA_phasic_max EDA_smna_mean EDA_smna_std EDA_smna_min EDA_smna_max EDA_tonic_mean EDA_tonic_std EDA_tonic_min EDA_tonic_max BVP_mean BVP_std BVP_min BVP_max TEMP_mean TEMP_std TEMP_min TEMP_max ACC_x_mean ACC_x_std ACC_x_min ACC_x_max ACC_y_mean ACC_y_std ACC_y_min ACC_y_max ACC_z_mean ACC_z_std ACC_z_min ACC_z_max Resp_mean Resp_std Resp_min Resp_max 0_mean 0_std 0_min 0_max BVP_peak_freq TEMP_slope subject label age height weight gender_ female gender_ male coffee_today_YES sport_today_YES smoker_NO smoker_YES feel_ill_today_YES
0 0 1.397968 0.141481 1.109299 1.678399 1.824289 1.088328 0.367977 4.319987 1.284376 1.952823 5.229656e-08 11.712596 1.232164 0.997487 -0.599164 2.554750 -0.181673 107.648359 -358.13 554.77 35.817091 0.012674 35.79 35.84 0.029510 0.011145 -0.024082 0.087383 0.000020 0.000008 -0.000017 0.000060 0.000020 0.000008 -0.000017 0.000060 0.148184 2.935617 -8.805847 6.504822 0.029937 0.009942 0.000000 0.087383 0.135670 -0.000169 2 1 27 175 80 0 1 0 0 1 0 0
1 1 1.210132 0.091882 1.014138 1.485800 2.109146 1.223528 0.539150 4.459367 1.467865 2.852510 3.096902e-08 17.418821 0.377615 1.172221 -1.213173 1.871490 -0.830147 118.742089 -392.28 438.16 35.797568 0.029901 35.75 35.87 0.017352 0.020817 -0.037843 0.071558 0.000012 0.000014 -0.000026 0.000049 0.000012 0.000014 -0.000026 0.000049 0.037545 2.843123 -8.168030 6.742859 0.021986 0.015845 0.000000 0.071558 0.095023 -0.000789 2 1 27 175 80 0 1 0 0 1 0 0
2 2 1.010977 0.102315 0.832216 1.190967 0.152828 0.128896 0.006950 0.544346 0.105091 0.244891 4.725602e-08 1.300810 1.727696 0.293389 1.137304 2.037179 0.939683 42.190039 -240.61 209.89 35.712909 0.027612 35.66 35.75 0.020839 0.011034 0.002752 0.054356 0.000014 0.000008 0.000002 0.000037 0.000014 0.000008 0.000002 0.000037 -0.021862 1.700333 -2.914429 3.260803 0.020839 0.011034 0.002752 0.054356 0.076880 -0.000717 2 1 27 175 80 0 1 0 0 1 0 0
3 3 0.775187 0.046391 0.693996 0.876819 0.177595 0.126167 0.002789 0.361388 0.110786 0.199704 2.787285e-08 1.105898 0.987927 0.042388 0.912441 1.127602 0.107404 41.606872 -289.26 145.36 35.700811 0.019504 35.66 35.73 0.034449 0.003185 0.013761 0.040595 0.000024 0.000002 0.000009 0.000028 0.000024 0.000002 0.000009 0.000028 0.097563 1.483260 -2.818298 3.730774 0.034449 0.003185 0.013761 0.040595 0.140271 0.000075 2 1 27 175 80 0 1 0 0 1 0 0
4 4 0.657494 0.034540 0.594667 0.718106 0.035014 0.039616 0.001144 0.132781 0.026716 0.114738 5.174645e-08 0.997037 0.772262 0.077628 0.615685 0.907833 -0.073620 43.121633 -197.37 194.12 35.744727 0.019386 35.71 35.79 0.028870 0.004379 0.013761 0.038531 0.000020 0.000003 0.000009 0.000027 0.000020 0.000003 0.000009 0.000027 0.062545 1.501585 -3.242493 2.912903 0.028870 0.004379 0.013761 0.038531 0.149321 0.000442 2 1 27 175 80 0 1 0 0 1 0 0
data.shape
(1178, 59)
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1178 entries, 0 to 1177
Data columns (total 59 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          1178 non-null   int64  
 1   net_acc_mean        1178 non-null   float64
 2   net_acc_std         1178 non-null   float64
 3   net_acc_min         1178 non-null   float64
 4   net_acc_max         1178 non-null   float64
 5   EDA_phasic_mean     1178 non-null   float64
 6   EDA_phasic_std      1178 non-null   float64
 7   EDA_phasic_min      1178 non-null   float64
 8   EDA_phasic_max      1178 non-null   float64
 9   EDA_smna_mean       1178 non-null   float64
 10  EDA_smna_std        1178 non-null   float64
 11  EDA_smna_min        1178 non-null   float64
 12  EDA_smna_max        1178 non-null   float64
 13  EDA_tonic_mean      1178 non-null   float64
 14  EDA_tonic_std       1178 non-null   float64
 15  EDA_tonic_min       1178 non-null   float64
 16  EDA_tonic_max       1178 non-null   float64
 17  BVP_mean            1178 non-null   float64
 18  BVP_std             1178 non-null   float64
 19  BVP_min             1178 non-null   float64
 20  BVP_max             1178 non-null   float64
 21  TEMP_mean           1178 non-null   float64
 22  TEMP_std            1178 non-null   float64
 23  TEMP_min            1178 non-null   float64
 24  TEMP_max            1178 non-null   float64
 25  ACC_x_mean          1178 non-null   float64
 26  ACC_x_std           1178 non-null   float64
 27  ACC_x_min           1178 non-null   float64
 28  ACC_x_max           1178 non-null   float64
 29  ACC_y_mean          1178 non-null   float64
 30  ACC_y_std           1178 non-null   float64
 31  ACC_y_min           1178 non-null   float64
 32  ACC_y_max           1178 non-null   float64
 33  ACC_z_mean          1178 non-null   float64
 34  ACC_z_std           1178 non-null   float64
 35  ACC_z_min           1178 non-null   float64
 36  ACC_z_max           1178 non-null   float64
 37  Resp_mean           1178 non-null   float64
 38  Resp_std            1178 non-null   float64
 39  Resp_min            1178 non-null   float64
 40  Resp_max            1178 non-null   float64
 41  0_mean              1178 non-null   float64
 42  0_std               1178 non-null   float64
 43  0_min               1178 non-null   float64
 44  0_max               1178 non-null   float64
 45  BVP_peak_freq       1178 non-null   float64
 46  TEMP_slope          1178 non-null   float64
 47  subject             1178 non-null   int64  
 48  label               1178 non-null   int64  
 49  age                 1178 non-null   int64  
 50  height              1178 non-null   int64  
 51  weight              1178 non-null   int64  
 52  gender_ female      1178 non-null   int64  
 53  gender_ male        1178 non-null   int64  
 54  coffee_today_YES    1178 non-null   int64  
 55  sport_today_YES     1178 non-null   int64  
 56  smoker_NO           1178 non-null   int64  
 57  smoker_YES          1178 non-null   int64  
 58  feel_ill_today_YES  1178 non-null   int64  
dtypes: float64(46), int64(13)
memory usage: 543.1 KB

We can observe that the all the data is numeric and there are no missing values. We will remove the first column as it is just a clone of the index.

data = data.drop([data.columns[0]], axis=1)
data.head()
net_acc_mean net_acc_std net_acc_min net_acc_max EDA_phasic_mean EDA_phasic_std EDA_phasic_min EDA_phasic_max EDA_smna_mean EDA_smna_std EDA_smna_min EDA_smna_max EDA_tonic_mean EDA_tonic_std EDA_tonic_min EDA_tonic_max BVP_mean BVP_std BVP_min BVP_max TEMP_mean TEMP_std TEMP_min TEMP_max ACC_x_mean ACC_x_std ACC_x_min ACC_x_max ACC_y_mean ACC_y_std ACC_y_min ACC_y_max ACC_z_mean ACC_z_std ACC_z_min ACC_z_max Resp_mean Resp_std Resp_min Resp_max 0_mean 0_std 0_min 0_max BVP_peak_freq TEMP_slope subject label age height weight gender_ female gender_ male coffee_today_YES sport_today_YES smoker_NO smoker_YES feel_ill_today_YES
0 1.397968 0.141481 1.109299 1.678399 1.824289 1.088328 0.367977 4.319987 1.284376 1.952823 5.229656e-08 11.712596 1.232164 0.997487 -0.599164 2.554750 -0.181673 107.648359 -358.13 554.77 35.817091 0.012674 35.79 35.84 0.029510 0.011145 -0.024082 0.087383 0.000020 0.000008 -0.000017 0.000060 0.000020 0.000008 -0.000017 0.000060 0.148184 2.935617 -8.805847 6.504822 0.029937 0.009942 0.000000 0.087383 0.135670 -0.000169 2 1 27 175 80 0 1 0 0 1 0 0
1 1.210132 0.091882 1.014138 1.485800 2.109146 1.223528 0.539150 4.459367 1.467865 2.852510 3.096902e-08 17.418821 0.377615 1.172221 -1.213173 1.871490 -0.830147 118.742089 -392.28 438.16 35.797568 0.029901 35.75 35.87 0.017352 0.020817 -0.037843 0.071558 0.000012 0.000014 -0.000026 0.000049 0.000012 0.000014 -0.000026 0.000049 0.037545 2.843123 -8.168030 6.742859 0.021986 0.015845 0.000000 0.071558 0.095023 -0.000789 2 1 27 175 80 0 1 0 0 1 0 0
2 1.010977 0.102315 0.832216 1.190967 0.152828 0.128896 0.006950 0.544346 0.105091 0.244891 4.725602e-08 1.300810 1.727696 0.293389 1.137304 2.037179 0.939683 42.190039 -240.61 209.89 35.712909 0.027612 35.66 35.75 0.020839 0.011034 0.002752 0.054356 0.000014 0.000008 0.000002 0.000037 0.000014 0.000008 0.000002 0.000037 -0.021862 1.700333 -2.914429 3.260803 0.020839 0.011034 0.002752 0.054356 0.076880 -0.000717 2 1 27 175 80 0 1 0 0 1 0 0
3 0.775187 0.046391 0.693996 0.876819 0.177595 0.126167 0.002789 0.361388 0.110786 0.199704 2.787285e-08 1.105898 0.987927 0.042388 0.912441 1.127602 0.107404 41.606872 -289.26 145.36 35.700811 0.019504 35.66 35.73 0.034449 0.003185 0.013761 0.040595 0.000024 0.000002 0.000009 0.000028 0.000024 0.000002 0.000009 0.000028 0.097563 1.483260 -2.818298 3.730774 0.034449 0.003185 0.013761 0.040595 0.140271 0.000075 2 1 27 175 80 0 1 0 0 1 0 0
4 0.657494 0.034540 0.594667 0.718106 0.035014 0.039616 0.001144 0.132781 0.026716 0.114738 5.174645e-08 0.997037 0.772262 0.077628 0.615685 0.907833 -0.073620 43.121633 -197.37 194.12 35.744727 0.019386 35.71 35.79 0.028870 0.004379 0.013761 0.038531 0.000020 0.000003 0.000009 0.000027 0.000020 0.000003 0.000009 0.000027 0.062545 1.501585 -3.242493 2.912903 0.028870 0.004379 0.013761 0.038531 0.149321 0.000442 2 1 27 175 80 0 1 0 0 1 0 0

Now we will explore the data. We will start by looking at the distribution of the features.

data.describe()
net_acc_mean net_acc_std net_acc_min net_acc_max EDA_phasic_mean EDA_phasic_std EDA_phasic_min EDA_phasic_max EDA_smna_mean EDA_smna_std EDA_smna_min EDA_smna_max EDA_tonic_mean EDA_tonic_std EDA_tonic_min EDA_tonic_max BVP_mean BVP_std BVP_min BVP_max TEMP_mean TEMP_std TEMP_min TEMP_max ACC_x_mean ACC_x_std ACC_x_min ACC_x_max ACC_y_mean ACC_y_std ACC_y_min ACC_y_max ACC_z_mean ACC_z_std ACC_z_min ACC_z_max Resp_mean Resp_std Resp_min Resp_max 0_mean 0_std 0_min 0_max BVP_peak_freq TEMP_slope subject label age height weight gender_ female gender_ male coffee_today_YES sport_today_YES smoker_NO smoker_YES feel_ill_today_YES
count 1178.000000 1178.000000 1178.000000 1178.000000 1.178000e+03 1.178000e+03 1.178000e+03 1.178000e+03 1.178000e+03 1.178000e+03 1.178000e+03 1.178000e+03 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1.178000e+03 1178.000000 1178.000000 1178.000000 1.178000e+03 1178.000000 1178.000000 1178.000000 1.178000e+03 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000 1178.000000
mean 1.968314 0.042767 1.882383 2.063071 1.723747e-01 9.033242e-02 5.006042e-02 3.701452e-01 1.315482e-01 2.487189e-01 7.136587e-08 1.475279e+00 -0.036642 0.084926 -0.185936 0.082993 -0.003669 49.059501 -229.518574 201.903862 33.010021 0.020199 32.972649 33.046859 0.011482 3.928262e-03 -0.004006 0.026907 0.000008 2.702853e-06 -0.000003 0.000019 0.000008 2.702853e-06 -0.000003 0.000019 0.053202 3.230961 -7.195496 8.485843 0.029583 0.003686 0.017335 0.045410 0.128328 -0.000015 9.395586 1.135823 27.479626 177.598472 73.057725 0.202886 0.797114 0.267402 0.133277 0.933786 0.066214 0.066214
std 2.661351 0.074763 2.534358 2.768385 6.003388e-01 3.856268e-01 1.644202e-01 1.301680e+00 4.890550e-01 6.814781e-01 5.421640e-08 4.462437e+00 1.244990 0.365706 1.563632 1.191452 0.953560 42.065002 219.592115 191.928814 1.469943 0.012848 1.468264 1.472664 0.028788 4.239653e-03 0.033927 0.032278 0.000020 2.917107e-06 0.000023 0.000022 0.000020 2.917107e-06 0.000023 0.000022 0.202373 1.623139 4.625907 5.486392 0.009536 0.003703 0.013342 0.017915 0.040534 0.000607 4.709366 0.669945 2.367654 6.549924 9.941378 0.402319 0.402319 0.442792 0.340018 0.248761 0.248761 0.248761
min 0.090000 0.000660 0.074363 0.096987 1.161183e-07 1.183557e-08 6.445254e-08 1.693472e-07 8.149629e-08 1.439258e-08 3.480068e-09 1.401867e-07 -12.916454 0.000130 -25.222599 -3.674357 -7.665458 2.820251 -1617.860000 7.270000 29.370901 0.006785 29.330000 29.410000 -0.044545 6.938894e-18 -0.088071 -0.041971 -0.000031 0.000000e+00 -0.000061 -0.000029 -0.000031 0.000000e+00 -0.000061 -0.000029 -1.031070 0.235230 -50.000000 0.572205 0.000603 0.000000 0.000000 0.001376 0.027134 -0.003532 2.000000 0.000000 24.000000 165.000000 54.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.308544 0.003338 0.296209 0.320464 3.658714e-03 3.811133e-03 2.201203e-05 1.481163e-02 2.681601e-03 1.228424e-02 3.406857e-08 8.813488e-02 -0.784974 0.006775 -0.857377 -0.756961 -0.238670 20.719236 -307.515000 64.827500 32.271545 0.013295 32.230000 32.310000 -0.020547 6.107677e-04 -0.030962 -0.003268 -0.000014 4.202407e-07 -0.000021 -0.000002 -0.000014 4.202407e-07 -0.000021 -0.000002 -0.027201 2.061239 -8.803940 4.637146 0.022995 0.000611 0.004816 0.033715 0.099548 -0.000303 5.000000 1.000000 26.000000 172.000000 66.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000
50% 0.867280 0.013109 0.786439 0.928208 1.937642e-02 1.729287e-02 7.609050e-04 6.490229e-02 1.568327e-02 5.180896e-02 5.590956e-08 3.510869e-01 -0.489699 0.023554 -0.515918 -0.450741 0.006632 37.028434 -156.660000 147.105000 33.158964 0.016356 33.130000 33.210000 0.024364 2.207785e-03 0.004816 0.036467 0.000017 1.519073e-06 0.000003 0.000025 0.000017 1.519073e-06 0.000003 0.000025 0.053004 2.853950 -5.764771 6.906128 0.030772 0.002198 0.015825 0.044035 0.131148 -0.000057 9.000000 1.000000 27.000000 178.000000 75.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 0.000000
75% 2.657175 0.047663 2.525427 2.830681 1.599508e-01 8.115187e-02 2.208239e-02 3.826789e-01 1.250409e-01 2.877148e-01 8.918624e-08 1.667771e+00 0.767034 0.068806 0.613979 0.919402 0.226479 64.278928 -75.865000 272.857500 34.008414 0.021719 33.970000 34.050000 0.036412 6.162301e-03 0.024082 0.049540 0.000025 4.239991e-06 0.000017 0.000034 0.000025 4.239991e-06 0.000017 0.000034 0.131744 3.940348 -4.020691 10.527420 0.037539 0.006049 0.028898 0.053668 0.149321 0.000188 14.000000 2.000000 28.000000 184.000000 80.000000 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 0.000000
max 15.638741 0.840366 15.172502 15.931444 1.471653e+01 1.001923e+01 2.691494e+00 2.963154e+01 1.298959e+01 1.643416e+01 3.625345e-07 1.172344e+02 3.053079 9.430388 3.006969 3.291220 8.145138 343.049267 -9.280000 1789.000000 35.929091 0.113687 35.910000 35.970000 0.043380 2.756758e-02 0.043347 0.087383 0.000030 1.896796e-05 0.000030 0.000060 0.000030 1.896796e-05 0.000030 0.000060 1.194413 10.890547 -0.923157 37.886047 0.044545 0.020700 0.043347 0.088071 0.298474 0.003132 17.000000 2.000000 35.000000 189.000000 90.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
def plot_distribution(data, nbrRows=12, nbrCols=5, figsize=(20, 30)):
    plt.figure(figsize=figsize)
    for c in range(len(data.columns)-12):
        plt.subplot(nbrRows, nbrCols, c+1)
        plt.hist(data.iloc[:, c], bins=50)
        plt.title(data.columns[c])
    plt.tight_layout()
    plt.show()
plot_distribution(data)

We can observe that the net_acc, EDA_phasic and EDA_smna features are very skewed to the left and there may be some outliers. Furthermore observing the acceleraton features we can observe that the readings from the three axis are very similar and it would be a good idea to explore it further. In order to get a better idea how the features correlate with each other we will plot a correlation matrix. Before we plot the matrix

# plot a correlation matrix
corr = data.corr()
mask = np.triu(np.ones_like(data.corr()))
plt.figure(figsize=(40, 40))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', mask=mask)
plt.show()

We can observe that some parts are more saturated than others, these spots indicate features with higher correlation to each other, thus we will explore them further. We will generally ignore high correlation between standard variation, minimum and maximum values as they are expected to be highly correlated. Keeping these factors in mind we can conclude that:

  • All the EDA features are highly correlated to each other and the other variables, except for the EDA_smna_min
  • The net_acc features are highly correlated to the other variables
  • The ACC features are highly correlated to the other variables and we can confirm, our previous observation, that they are also highly correlated to each other.
  • BVP and RESP have a very high correlation to other features, with the exception of the mean of those features which have almost no correlation at all.
  • TEMP min, max and mean features have a low correlation to the other variables and are incredibly similar to each other, thus they can be combined into one feature. On the other hand the standard deviation and the slope of the temperature have a very low coorelation score with the other variables, thus they can be removed.

For the categorical features the only useful correlations are between the age, height and weight features and the physiological and motion features, thus we can safely remove the rest. Before we do that, it must be noted that there is an obvious correlation between subject and TEMP features, gender and acceleartion, sport and acceleration and of course between gender and height and weight.

Now we will remove the features that won't be of use to us or overlap with other features. The columns in question are: EDA_smna_min, BVP_mean, TEMP_std, TEMP_slope, RESP_mean, smoker_YES, smoker_NO, feel_ill_today_YES, sport_today_YES, coffee_today_YES, gender_male and gender_female. Furthermore we will leave only the TEMP_meanfeature and remove TEMP_min and TEMP_max, in adddition to merging all the acceleration features into one.

columns_to_remove = ['EDA_smna_min', 'BVP_mean', 'TEMP_std', 'TEMP_slope', 'Resp_mean', 'smoker_YES', 'smoker_NO', 'feel_ill_today_YES', 'sport_today_YES', 'coffee_today_YES', 'gender_ male', 'gender_ female', 'TEMP_min', 'TEMP_max']
data_m1 = data.drop(columns_to_remove, axis=1)
data_m1.head()
net_acc_mean net_acc_std net_acc_min net_acc_max EDA_phasic_mean EDA_phasic_std EDA_phasic_min EDA_phasic_max EDA_smna_mean EDA_smna_std EDA_smna_max EDA_tonic_mean EDA_tonic_std EDA_tonic_min EDA_tonic_max BVP_std BVP_min BVP_max TEMP_mean ACC_x_mean ACC_x_std ACC_x_min ACC_x_max ACC_y_mean ACC_y_std ACC_y_min ACC_y_max ACC_z_mean ACC_z_std ACC_z_min ACC_z_max Resp_std Resp_min Resp_max 0_mean 0_std 0_min 0_max BVP_peak_freq subject label age height weight
0 1.397968 0.141481 1.109299 1.678399 1.824289 1.088328 0.367977 4.319987 1.284376 1.952823 11.712596 1.232164 0.997487 -0.599164 2.554750 107.648359 -358.13 554.77 35.817091 0.029510 0.011145 -0.024082 0.087383 0.000020 0.000008 -0.000017 0.000060 0.000020 0.000008 -0.000017 0.000060 2.935617 -8.805847 6.504822 0.029937 0.009942 0.000000 0.087383 0.135670 2 1 27 175 80
1 1.210132 0.091882 1.014138 1.485800 2.109146 1.223528 0.539150 4.459367 1.467865 2.852510 17.418821 0.377615 1.172221 -1.213173 1.871490 118.742089 -392.28 438.16 35.797568 0.017352 0.020817 -0.037843 0.071558 0.000012 0.000014 -0.000026 0.000049 0.000012 0.000014 -0.000026 0.000049 2.843123 -8.168030 6.742859 0.021986 0.015845 0.000000 0.071558 0.095023 2 1 27 175 80
2 1.010977 0.102315 0.832216 1.190967 0.152828 0.128896 0.006950 0.544346 0.105091 0.244891 1.300810 1.727696 0.293389 1.137304 2.037179 42.190039 -240.61 209.89 35.712909 0.020839 0.011034 0.002752 0.054356 0.000014 0.000008 0.000002 0.000037 0.000014 0.000008 0.000002 0.000037 1.700333 -2.914429 3.260803 0.020839 0.011034 0.002752 0.054356 0.076880 2 1 27 175 80
3 0.775187 0.046391 0.693996 0.876819 0.177595 0.126167 0.002789 0.361388 0.110786 0.199704 1.105898 0.987927 0.042388 0.912441 1.127602 41.606872 -289.26 145.36 35.700811 0.034449 0.003185 0.013761 0.040595 0.000024 0.000002 0.000009 0.000028 0.000024 0.000002 0.000009 0.000028 1.483260 -2.818298 3.730774 0.034449 0.003185 0.013761 0.040595 0.140271 2 1 27 175 80
4 0.657494 0.034540 0.594667 0.718106 0.035014 0.039616 0.001144 0.132781 0.026716 0.114738 0.997037 0.772262 0.077628 0.615685 0.907833 43.121633 -197.37 194.12 35.744727 0.028870 0.004379 0.013761 0.038531 0.000020 0.000003 0.000009 0.000027 0.000020 0.000003 0.000009 0.000027 1.501585 -3.242493 2.912903 0.028870 0.004379 0.013761 0.038531 0.149321 2 1 27 175 80

In order to decide what to do with the acceleration values will again make a correlation matrix, but this time only for the acceleration features.

corr = data_m1.loc[:, 'ACC_x_mean':'ACC_z_max'].corr()
mask = np.triu(np.ones_like(corr))

plt.figure(figsize=(10, 10))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', mask=mask)
plt.show()

As we can see th columns are highly correlated to each other, thus we will merge the ACC_x, ACC_y and ACC_z features into ACC_mean_mean, ACC_mean_std, ACC_mean_min, ACC_mean_max.

# combine the ACC features for the three axis into one feature
data_m1['ACC_mean_mean'] = data_m1[['ACC_x_mean', 'ACC_y_mean', 'ACC_z_mean']].mean(axis=1)
data_m1['ACC_mean_std'] = data_m1[['ACC_x_std', 'ACC_y_std', 'ACC_z_std']].mean(axis=1)
data_m1['ACC_mean_min'] = data_m1[['ACC_x_min', 'ACC_y_min', 'ACC_z_min']].mean(axis=1)
data_m1['ACC_mean_max'] = data_m1[['ACC_x_max', 'ACC_y_max', 'ACC_z_max']].mean(axis=1)
data_m2 = data_m1.drop(['ACC_x_mean', 'ACC_y_mean', 'ACC_z_mean', 'ACC_x_std', 'ACC_y_std', 'ACC_z_std', 'ACC_x_min', 'ACC_y_min', 'ACC_z_min', 'ACC_x_max', 'ACC_y_max', 'ACC_z_max'], axis=1)
data_m2.head()
net_acc_mean net_acc_std net_acc_min net_acc_max EDA_phasic_mean EDA_phasic_std EDA_phasic_min EDA_phasic_max EDA_smna_mean EDA_smna_std EDA_smna_max EDA_tonic_mean EDA_tonic_std EDA_tonic_min EDA_tonic_max BVP_std BVP_min BVP_max TEMP_mean Resp_std Resp_min Resp_max 0_mean 0_std 0_min 0_max BVP_peak_freq subject label age height weight ACC_mean_mean ACC_mean_std ACC_mean_min ACC_mean_max
0 1.397968 0.141481 1.109299 1.678399 1.824289 1.088328 0.367977 4.319987 1.284376 1.952823 11.712596 1.232164 0.997487 -0.599164 2.554750 107.648359 -358.13 554.77 35.817091 2.935617 -8.805847 6.504822 0.029937 0.009942 0.000000 0.087383 0.135670 2 1 27 175 80 0.009850 0.003720 -0.008038 0.029168
1 1.210132 0.091882 1.014138 1.485800 2.109146 1.223528 0.539150 4.459367 1.467865 2.852510 17.418821 0.377615 1.172221 -1.213173 1.871490 118.742089 -392.28 438.16 35.797568 2.843123 -8.168030 6.742859 0.021986 0.015845 0.000000 0.071558 0.095023 2 1 27 175 80 0.005792 0.006949 -0.012632 0.023885
2 1.010977 0.102315 0.832216 1.190967 0.152828 0.128896 0.006950 0.544346 0.105091 0.244891 1.300810 1.727696 0.293389 1.137304 2.037179 42.190039 -240.61 209.89 35.712909 1.700333 -2.914429 3.260803 0.020839 0.011034 0.002752 0.054356 0.076880 2 1 27 175 80 0.006956 0.003683 0.000919 0.018144
3 0.775187 0.046391 0.693996 0.876819 0.177595 0.126167 0.002789 0.361388 0.110786 0.199704 1.105898 0.987927 0.042388 0.912441 1.127602 41.606872 -289.26 145.36 35.700811 1.483260 -2.818298 3.730774 0.034449 0.003185 0.013761 0.040595 0.140271 2 1 27 175 80 0.011499 0.001063 0.004593 0.013550
4 0.657494 0.034540 0.594667 0.718106 0.035014 0.039616 0.001144 0.132781 0.026716 0.114738 0.997037 0.772262 0.077628 0.615685 0.907833 43.121633 -197.37 194.12 35.744727 1.501585 -3.242493 2.912903 0.028870 0.004379 0.013761 0.038531 0.149321 2 1 27 175 80 0.009637 0.001462 0.004593 0.012861

!!!!!!!!!!!!!!!!!!!! REMOVE OR REVISIT !!!!!!!!!!!!!!!!!!!!!!!
We will also temporarily remove the 0_mean, 0_std, 0_min and EDA_max features as they do not make any sense.

data_m2 = data_m2.drop(['0_mean', '0_std', '0_min', '0_max'], axis=1)

Now we will plot the correlation of the features to the target variable.

# one hot encode the label column
data_m2_ohe = pd.get_dummies(data_m2, columns=['label'], prefix='', prefix_sep='')
corr = data_m2_ohe.corr()
corr_data = pd.DataFrame({'amusement': corr['0'], 'baseline': corr['1'], 'stress': corr['2']}) 
corr_data = corr_data.drop(['0', '1', '2'], axis=0).sort_values(by='baseline', ascending=False)

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(corr_data, vmin=-1, vmax=1, cmap="seismic", annot=True, ax=ax)
plt.show()

From the graph above we can conclude that the feature that have the highest correlation to the target vriable are EDA_tonic, Resp and net_acc.

data_m2.label.unique()
array([1, 2, 0], dtype=int64)

We can see that there are three unique values in the label column, according to the WESAD data preparation repository these values represent the following stress levels:

  • 0 - Amusement
  • 1 - Baseline
  • 2 - Stress

Now we will separate the data by each label and observe how the features are affected by different stress levels.

amusement_data = data_m2[data_m2['label'] == 0].reset_index(drop=True)
baseline_data = data_m2[data_m2['label'] == 1].reset_index(drop=True)
stress_data = data_m2[data_m2['label'] == 2].reset_index(drop=True)
def plot_data_per_label(data, feature, labels=['amusement', 'baseline', 'stress']):	
    plt.figure(figsize=(20, 6))
    plt.title(f"{feature} variance per label")
    plt.xlabel("N records (30 second window)")
    plt.ylabel("{feature} measurement")

    for d in data:
        plt.plot(d[feature])

    plt.legend(labels, loc='upper right')
    plt.show()

The features that we will plot will be the same as the ones that we used to plot the correlation to the target variable, except the subject, age, height and weight as they had a very low correlation to the label, thus we won't observe any meaningful difference in the plots.

features = data_m2.columns.drop(['label', 'subject', 'age', 'height', 'weight'])
features
Index(['net_acc_mean', 'net_acc_std', 'net_acc_min', 'net_acc_max',
       'EDA_phasic_mean', 'EDA_phasic_std', 'EDA_phasic_min', 'EDA_phasic_max',
       'EDA_smna_mean', 'EDA_smna_std', 'EDA_smna_max', 'EDA_tonic_mean',
       'EDA_tonic_std', 'EDA_tonic_min', 'EDA_tonic_max', 'BVP_std', 'BVP_min',
       'BVP_max', 'TEMP_mean', 'Resp_std', 'Resp_min', 'Resp_max',
       'BVP_peak_freq', 'ACC_mean_mean', 'ACC_mean_std', 'ACC_mean_min',
       'ACC_mean_max'],
      dtype='object')

We can see that the categorical values have been successfully removed, thus we will proceed to plot the efect of the stress level on the features.

Feature variance per label

from scipy.interpolate import make_interp_spline

# required inputs: data - a list of dataframes that correspond to the data for each label, feature - the feature to plot, labels - the labels for the data
def plot_smooth_data_per_label(data, feature, labels=['amusement', 'baseline', 'stress'], num=100, b_spline_degree=3):
    plt.figure(figsize=(20, 6))
    plt.title(f"{feature} variance per label")
    plt.xlabel("N records (30 second window)")
    plt.ylabel(f"{feature} measurement")

    # loop through the data for each srtress level, smooth it and plot it
    for d in data:
        xnew = np.linspace(d[feature].index.min(), d[feature].index.max(), num)
        spl = make_interp_spline(d[feature].index, d[feature], k=b_spline_degree)
        power_smooth = spl(xnew)
        plt.plot(xnew, power_smooth)

    plt.legend(labels, loc='upper right')
    #plt.show()
    return plt.gcf()
stress_level_data = [amusement_data, baseline_data, stress_data]
#for f in features:
#    plot_smooth_data_per_label(stress_level_data, f, num=50, b_spline_degree=2)
from IPython.display import display, Markdown
for f in features:
    display(Markdown(f"## {f} feature"))
    plot = plot_smooth_data_per_label(stress_level_data, f, num=50, b_spline_degree=2)
    display(Markdown(matplotlib_fig_to_markdown(plot)))

net_acc_mean feature

image

net_acc_std feature

image

net_acc_min feature

image

net_acc_max feature

image

EDA_phasic_mean feature

image

EDA_phasic_std feature

image

EDA_phasic_min feature

image

EDA_phasic_max feature

image

EDA_smna_mean feature

image

EDA_smna_std feature

image

EDA_smna_max feature

image

EDA_tonic_mean feature

image

EDA_tonic_std feature

image

EDA_tonic_min feature

image

EDA_tonic_max feature

image

BVP_std feature

image

BVP_min feature

image

BVP_max feature

image

TEMP_mean feature

image

Resp_std feature

image

Resp_min feature

image

Resp_max feature

image

BVP_peak_freq feature

image

ACC_mean_mean feature

image

ACC_mean_std feature

image

ACC_mean_min feature

image

ACC_mean_max feature

image

Findings from feature variance

From the plots above we can clearly see that in general the net_acc_std, RESP_max, RESP_std and all the EDA features are higher when a person is stressed. The ACC, TEMP, Resp_min and BVP features do not show any sustantial difference between the stress levels. Additionally, there is not any absorvabale difference between baseline and amusement stress levels that is correlated to the features.

Now we will use the SelectKBest method to select the best features. We will use the f_classif method as we are dealing with numerical input and categorical output.

from sklearn.feature_selection import SelectKBest, f_classif
empE4_features = features.to_list()
empE4_features.remove('Resp_std')
empE4_features.remove('Resp_min')
empE4_features.remove('Resp_max')
data_m3_X = data_m2[empE4_features]
data_m3_y = data_m2['label']
print(f"Old data shape: {data_m3_X.shape}")	

data_m3_X_new = SelectKBest(f_classif, k=5).fit_transform(data_m3_X, data_m3_y)
print(f"New data shape: {data_m3_X_new.shape}")

selected_features = data_m3_X.columns[SelectKBest(f_classif, k=5).fit(data_m3_X, data_m3_y).get_support()]
selected_features
Old data shape: (1178, 24)
New data shape: (1178, 5)
Index(['net_acc_std', 'net_acc_max', 'EDA_tonic_mean', 'EDA_tonic_min',
       'EDA_tonic_max'],
      dtype='object')

We can observe that the selected features are 'net_acc_std', 'net_acc_max', 'EDA_tonic_mean', 'EDA_tonic_min', 'EDA_tonic_max', 'Resp_std', 'Resp_max'. Now we will plot each feature against the target variable per subject to observe if there is a clearer diferentiation between the stress levels in the selected features .

Data analysis by subject

def plot_smooth_data_per_label_per_subject(data, feature, subjects, labels=['amusement', 'baseline', 'stress'], num=100, b_spline_degree=3):
    plt.figure(figsize=(20, 20))

    for s in subjects:
        s_index = np.where(subjects == s)[0][0]
        plt.subplot(5, 3, s_index + 1)
        plt.title(f"{feature} variance per label for subject {s}")
        plt.xlabel("N records (30 second window)")
        plt.ylabel(f"{feature} measurement")
        
        data1 = [data[0][data[0]['subject'] == s].reset_index(drop=True), 
            data[1][data[1]['subject'] == s].reset_index(drop=True), 
            data[2][data[2]['subject'] == s].reset_index(drop=True)]

        # loop through the data for each stress level, smooth it and plot it
        for d in data1:
            xnew = np.linspace(d[feature].index.min(), d[feature].index.max(), num)
            spl = make_interp_spline(d[feature].index, d[feature], k=b_spline_degree)
            power_smooth = spl(xnew)
            plt.plot(xnew, power_smooth)
        plt.legend(labels, loc='upper right')

    plt.subplots_adjust(left=0.1,
                    bottom=0.1,
                    right=1,
                    top=0.9,
                    wspace=0.3,
                    hspace=0.4)
    return plt.gcf()
data_m3 = data_m2[[*selected_features.tolist(), 'label', 'subject']]
amusement_data = data_m3[data_m3['label'] == 0].reset_index(drop=True)
baseline_data = data_m3[data_m3['label'] == 1].reset_index(drop=True)
stress_data = data_m3[data_m3['label'] == 2].reset_index(drop=True)

#for f in selected_features.tolist():
#    plot_smooth_data_per_label_per_subject([amusement_data, baseline_data, stress_data], feature=f, subjects=data_m2['subject'].unique())
for f in selected_features.tolist():
    display(Markdown(f"## {f} variance per label for all subjects"))
    plot = plot_smooth_data_per_label_per_subject([amusement_data, baseline_data, stress_data], feature=f, subjects=data_m2['subject'].unique())
    display(Markdown(matplotlib_fig_to_markdown(plot)))

net_acc_std variance per label for all subjects

image

net_acc_max variance per label for all subjects

image

EDA_tonic_mean variance per label for all subjects

image

EDA_tonic_min variance per label for all subjects

image

EDA_tonic_max variance per label for all subjects

image

Conclusion

From observing the graphs above we can conclude that for the EDA_tonic features almost all the subjects, except for subject 17, have a noticeably higher and fluctoating values when they are stressed. In the case of the net_acc_std feature we can see that there are way more fluctiotaions in the readings of stress patients and the values are generally higher, the difference in the values of stressed and baseline periods can be observed more clearly in the net_acc_max feature graphs as there aren't any big oscilations. We can conclude that even a basic Machine Learning model, or a thresholding algorithm will be able to classify the stress levels with a high accuracy, because of the clear difference between the features of the different stress levels.